Goto

Collaborating Authors

 training data selection


Uncertainty Under the Curve: A Sequence-Level Entropy Area Metric for Reasoning LLM

arXiv.org Artificial Intelligence

In this work, we introduce Entropy Area Score (EAS), a simple yet effective metric to quantify uncertainty in the answer generation process of reasoning large language models (LLMs). EAS requires neither external models nor repeated sampling, it integrates token-level predictive entropy from the model itself to capture the evolution of uncertainty during generation. Empirical results show that EAS is strongly correlated with answer entropy across models and datasets. In training data selection, EAS identifies high-potential samples and consistently outperforms Pass Rate filtering under equal sample budgets, improving student model accuracy on math benchmarks. EAS is both efficient and interpretable, offering a practical tool for uncertainty modeling and data quality assessment in LLM training.


Sliced-Wasserstein Distance-based Data Selection

arXiv.org Artificial Intelligence

We propose a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches. Our filtering technique is interesting for decision-making pipelines deploying machine learning models in critical sectors, e.g., power systems, as it offers a conservative data selection and an optimal transport interpretation. To ensure the scalability of our method, we provide two efficient approximations. The first approximation processes reduced-cardinality representations of the datasets concurrently. The second makes use of a computationally light Euclidian distance approximation. Additionally, we open the first dataset showcasing localized critical peak rebate demand response in a northern climate. We present the filtering patterns of our method on synthetic datasets and numerically benchmark our method for training data selection. Finally, we employ our method as part of a first forecasting benchmark for our open-source dataset.


Improving Radiography Machine Learning Workflows via Metadata Management for Training Data Selection

arXiv.org Artificial Intelligence

Most machine learning models require many iterations of hyper-parameter tuning, feature engineering, and debugging to produce effective results. As machine learning models become more complicated, this pipeline becomes more difficult to manage effectively. In the physical sciences, there is an ever-increasing pool of metadata that is generated by the scientific research cycle. Tracking this metadata can reduce redundant work, improve reproducibility, and aid in the feature and training dataset engineering process. In this case study, we present a tool for machine learning metadata management in dynamic radiography. We evaluate the efficacy of this tool against the initial research workflow and discuss extensions to general machine learning pipelines in the physical sciences.


Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Neural Information Processing Systems

In this paper, we consider the problem of active learning in trigonomet(cid:173) ric polynomial networks and give a necessary and sufficient condition of sample points to provide the optimal generalization capability. By ana(cid:173) lyzing the condition from the functional analytic point of view, we clarify the mechanism of achieving the optimal generalization capability. We also show that a set of training examples satisfying the condition does not only provide the optimal generalization but also reduces the compu(cid:173) tational complexity and memory required for the calculation of learning results. Finally, examples of sample points satisfying the condition are given and computer simulations are performed to demonstrate the effec(cid:173) tiveness of the proposed active learning method.


Sub-Setting Algorithm for Training Data Selection in Pattern Recognition

arXiv.org Machine Learning

Modern pattern recognition tasks use complex algorithms that take advantage of large datasets to make more accurate predictions than traditional algorithms such as decision trees or k-nearest-neighbor better suited to describe simple structures. While increased accuracy is often crucial, less complexity also has value. This paper proposes a training data selection algorithm that identifies multiple subsets with simple structures. A learning algorithm trained on such a subset can classify an instance belonging to the subset with better accuracy than the traditional learning algorithms. In other words, while existing pattern recognition algorithms attempt to learn a global mapping function to represent the entire dataset, we argue that an ensemble of simple local patterns may better describe the data. Hence the sub-setting algorithm identifies multiple subsets with simple local patterns by identifying similar instances in the neighborhood of an instance. This motivation has similarities to that of gradient boosted trees but focuses on the explainability of the model that is missing for boosted trees. The proposed algorithm thus balances accuracy and explainable machine learning by identifying a limited number of subsets with simple structures. We applied the proposed algorithm to the international stroke dataset to predict the probability of survival. Our bottom-up sub-setting algorithm performed on an average 15% better than the top-down decision tree learned on the entire dataset. The different decision trees learned on the identified subsets use some of the previously unused features by the whole dataset decision tree, and each subset represents a distinct population of data.


Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Neural Information Processing Systems

In this paper, we consider the problem of active learning in trigonometric polynomial networks and give a necessary and sufficient condition of sample points to provide the optimal generalization capability. By analyzing the condition from the functional analytic point of view, we clarify the mechanism of achieving the optimal generalization capability. We also show that a set of training examples satisfying the condition does not only provide the optimal generalization but also reduces the computational complexity and memory required for the calculation of learning results. Finally, examples of sample points satisfying the condition are given and computer simulations are performed to demonstrate the effectiveness of the proposed active learning method.


Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Neural Information Processing Systems

In this paper, we consider the problem of active learning in trigonometric polynomial networks and give a necessary and sufficient condition of sample points to provide the optimal generalization capability. By analyzing the condition from the functional analytic point of view, we clarify the mechanism of achieving the optimal generalization capability. We also show that a set of training examples satisfying the condition does not only provide the optimal generalization but also reduces the computational complexity and memory required for the calculation of learning results. Finally, examples of sample points satisfying the condition are given and computer simulations are performed to demonstrate the effectiveness of the proposed active learning method.


Training Data Selection for Optimal Generalization in Trigonometric Polynomial Networks

Neural Information Processing Systems

In this paper, we consider the problem of active learning in trigonometric polynomialnetworks and give a necessary and sufficient condition of sample points to provide the optimal generalization capability. By analyzing thecondition from the functional analytic point of view, we clarify the mechanism of achieving the optimal generalization capability. We also show that a set of training examples satisfying the condition does not only provide the optimal generalization but also reduces the computational complexityand memory required for the calculation of learning results. Finally, examples of sample points satisfying the condition are given and computer simulations are performed to demonstrate the effectiveness ofthe proposed active learning method. 1 Introduction Supervised learning is obtaining an underlying rule from training examples, and can be formulated as a function approximation problem. If sample points are actively designed, then learning can be performed more efficiently. In this paper, we discuss the problem of designing sample points, referred to as active learning, for optimal generalization. Active learning is classified into two categories depending on the optimality. One is global optimal, where a set of all training examples is optimal (e.g.